In this data analysis report, we explore the relationship between GDP per capita and life expectancy in different countries. We begin by conducting exploratory data analysis to understand the correlation between these two variables. We then move on to principal component analysis to identify the leading components that explain the variance in GDP and life expectancy data.
We utilize linear discriminant analysis and clustering techniques to investigate the data from a classification standpoint. By applying these methods, we aim to uncover the underlying clustering patterns present within the continents.
Additionally, we perform hypothesis testing to determine if GDP and life expectancy are equal for Asia and Europe. Lastly, we use linear discriminant analysis and linear regression models to predict life expectancy based on GDP values.
gap.raw <- read.csv('gap.csv')
gap <- gap.raw
gap[,3:14] <- log(gap.raw[,3:14])
gdp <- exp(gap[,3:14])
years <- seq(1952, 2007,5)
colnames(gdp) <- years
rownames(gdp) <- gap[,2]
lifeExp <- gap[,15:26]
colnames(lifeExp) <- years
rownames(lifeExp) <- gap[,2]
I start my analysis with the relation between both variables life expectancy and GDP. It is expected that as a country has a higher GDP it will have more resources to invest in health and a healthy lifestyle. People living in poor countries will not have access to healthy expensive food and expensive health treatments, all those things can increase or decrease the life expectancy of people. As we see on the next plot Kuwait and Saudi Arabia have a behavior different from other countries, that is why I remove them to see a better plot. However, as removing outliers requires further analysis out of the scope of this work, I will keep them for the rest of this course work.
library(tidyr)
library(ggplot2)
library(ggrepel)
x=pivot_longer(as.data.frame(t(lifeExp)), Algeria:"New Zealand")
y=pivot_longer(as.data.frame(t(gdp)), Algeria:"New Zealand")
data = data.frame(x[,c("name","value")],y[,"value"])
colnames(data)=c("country","LifeExp","GDP")
ggplot(data, aes(x=GDP, y=LifeExp)) + geom_point() + geom_smooth() +
geom_text_repel(aes(label=country), force = 3, max.overlaps = 50)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: ggrepel: 1678 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
#library(car)
x=pivot_longer(as.data.frame(t(lifeExp)), Algeria:"New Zealand")
y=pivot_longer(as.data.frame(t(gdp)), Algeria:"New Zealand")
data = data.frame(x[,c("name","value")],y[,"value"])
colnames(data)=c("country","LifeExp","GDP")
data = data[!(data$country=="Kuwait"|data$country=="Saudi Arabia"),]
ggplot(data, aes(x=GDP, y=LifeExp)) + geom_point() + geom_smooth() +
geom_text_repel(aes(label=country), force = 3, max.overlaps = 50)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: ggrepel: 1632 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# par(mfrow=c(4,3))
# for(i in rownames(gdp)){
# plot(x[,i], y[,i], ylab="Life Exp.", xlab="GDP")
# }
It is very interesting that if we apply a variance-stabilizing transformation to GDP, in this case logarithmic , we obtain a much linear curve. Also as we see on the next plot the variance is not constant (heteroskedasticity).
x=pivot_longer(as.data.frame(t(lifeExp)), Algeria:"New Zealand")
y=pivot_longer(as.data.frame(t(gdp)), Algeria:"New Zealand")
data = data.frame(x[,c("name","value")],y[,"value"])
colnames(data)=c("country","LifeExp","GDP")
data = data[!(data$country=="Kuwait"|data$country=="Saudi Arabia"),]
ggplot(data, aes(x=log(GDP), y=LifeExp)) + geom_point() + geom_smooth(method = lm) +
geom_text_repel(aes(label=country), force = 3, max.overlaps = 50)
## `geom_smooth()` using formula 'y ~ x'
## Warning: ggrepel: 1578 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
From this analysis I conclude that GDP could be a good predictor of life expectancy as they are highly correlated.
To carry out the PCA with R we can use either prcomp and princomp, however, prcomp is more recommended because it works doing singular value decomposition and this method has higher numerical accuracy.
gdp.pca = prcomp(log(gdp), scale = TRUE)
lifeExpectancy.pca = prcomp(lifeExp, scale = TRUE)
Here I calculate the proportion of variation explained by each of the principal components, and provide a scree plot.
summary(gdp.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.3592 0.73033 0.32047 0.18149 0.13622 0.09627 0.07515
## Proportion of Variance 0.9404 0.04445 0.00856 0.00274 0.00155 0.00077 0.00047
## Cumulative Proportion 0.9404 0.98481 0.99337 0.99612 0.99766 0.99844 0.99891
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.06154 0.05953 0.04728 0.04439 0.03990
## Proportion of Variance 0.00032 0.00030 0.00019 0.00016 0.00013
## Cumulative Proportion 0.99922 0.99952 0.99970 0.99987 1.00000
screeplot(gdp.pca, type="lines",col="blue", main="Variance explained by PC (GDP)")
title(xlab="Principal Components")
summary(lifeExpectancy.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.3284 0.8219 0.39169 0.21950 0.1250 0.1098 0.08778
## Proportion of Variance 0.9232 0.0563 0.01279 0.00402 0.0013 0.0010 0.00064
## Cumulative Proportion 0.9232 0.9795 0.99229 0.99630 0.9976 0.9986 0.99925
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.06709 0.04413 0.03673 0.02815 0.02024
## Proportion of Variance 0.00038 0.00016 0.00011 0.00007 0.00003
## Cumulative Proportion 0.99963 0.99979 0.99990 0.99997 1.00000
screeplot(lifeExpectancy.pca, type="lines",col="blue", main="Variance explained by PC (Life Expectancy)")
title(xlab="Principal Components")
As we see in the R output PC1 explains 94 percent of the variance, PC2 explains 4.4% of the variance, PC3 0.85% of the variance, and so on. If I want to retain more than 90% of the data it would be enough to keep just PC1, but with adding PC2 I retain 98% of the variability of the data. Heuristically analyzing the screeplot using the elbow rule I would say that until 3 PCs it is still worth it to retain because when I pass from 2 to three PCs the variance still changes, but passing from three to four and so on the changes are not significant to my eyes.
The screeplot in this case is pretty similar to the one obtained for GDP. 92.32 %, 5.63%, and 1.279 % of the variance are explained by PC1, PC2, and PC3 respectively. For this data, I also think that it is worth it to retain three PCs at most.
Something very interesting is that I can calculate the correlation between PCs and variables and then understand easier the meaning of each PCs. This is what I do in the next code chunk and I also show the Rotation matrix. Conclusions are identical using the rotation matrix though.
GDP PC1 represents basically GDP, PC1 has a high negative correlation with all the GDP between 1952 and 2007. As PC1 increases the GDP becomes lower. PC2 is representing GDP between 1952 and 1977 positively and between 1982 to 2007 negatively. On the other hand countries with a high PC3 will have high values of GDP between 1952 and 1962 and between 1997 and 2007.
Life Expectancy PC1 has a high positive correlation to life expectancy, this means that as a country has a high PC1, its life expectancy will be high as well. PC2 represents positively the values of life expectancy between 1952 and 1982, and conversely negatively from 1987 and 2007. PC3 represents low values of life expectancy from 1952 to 1967 and between 2002 and 2007, but the opposite for the years between 1972 and 1997.
print("GDP rotation:")
[1] "GDP rotation:"
gdp.pca$rotation[,1:3]
## PC1 PC2 PC3
## 1952 -0.2807487 0.40051233 0.43574977
## 1957 -0.2848414 0.37055497 0.28293152
## 1962 -0.2888408 0.31874715 0.10896406
## 1967 -0.2910711 0.24625845 -0.11432067
## 1972 -0.2930171 0.16692054 -0.27940184
## 1977 -0.2944425 0.05722029 -0.36986452
## 1982 -0.2941942 -0.04581450 -0.40967555
## 1987 -0.2932798 -0.14621010 -0.30610546
## 1992 -0.2910449 -0.24525213 -0.06508071
## 1997 -0.2871994 -0.33718777 0.14420739
## 2002 -0.2840643 -0.38710054 0.25995158
## 2007 -0.2808777 -0.40215905 0.36894940
cor(log(gdp),gdp.pca$x)
## PC1 PC2 PC3 PC4 PC5 PC6
## 1952 -0.9430974 0.29250776 0.13964394 0.05725404 0.0101602689 -0.026886032
## 1957 -0.9568454 0.27062889 0.09067055 0.03226023 0.0151704111 0.002979152
## 1962 -0.9702806 0.23279188 0.03491952 -0.02343792 0.0089531003 0.024451045
## 1967 -0.9777725 0.17985091 -0.03663614 -0.07999819 -0.0272921424 0.041908503
## 1972 -0.9843095 0.12190774 -0.08953940 -0.05009606 -0.0570389038 -0.008198222
## 1977 -0.9890979 0.04178992 -0.11852981 -0.02433137 0.0071909926 -0.066601827
## 1982 -0.9882638 -0.03345988 -0.13128798 0.02390850 0.0539393316 -0.004764855
## 1987 -0.9851922 -0.10678221 -0.09809706 0.05496732 0.0534204639 0.037494612
## 1992 -0.9776844 -0.17911597 -0.02085630 0.08317054 -0.0524172159 0.013390365
## 1997 -0.9647666 -0.24625969 0.04621388 0.03364496 -0.0580195297 -0.005921415
## 2002 -0.9542352 -0.28271267 0.08330621 -0.03034894 -0.0001959895 -0.003526780
## 2007 -0.9435305 -0.29371042 0.11823655 -0.07749828 0.0470637365 -0.004763317
## PC7 PC8 PC9 PC10 PC11
## 1952 0.008782268 -0.033852799 0.010181361 -0.0026470059 -0.0055072634
## 1957 0.003604402 0.032315919 -0.001056862 -0.0001880523 0.0153835923
## 1962 -0.021628834 0.022098960 -0.012610536 0.0088033326 -0.0154063239
## 1967 -0.014810550 -0.027653594 -0.011183383 -0.0102816785 0.0058569204
## 1972 0.038497121 0.008761483 0.027521005 0.0093382743 0.0002343443
## 1977 -0.011023208 0.006878510 -0.017291106 -0.0176042027 -0.0053997734
## 1982 -0.015626522 -0.013444835 0.002693047 0.0304410341 0.0105202212
## 1987 0.011823888 0.003593338 0.017406019 -0.0244498912 -0.0043296402
## 1992 0.023383443 -0.002823566 -0.032767219 0.0070424296 -0.0087635373
## 1997 -0.034081572 0.003160905 0.012276381 -0.0062930720 0.0205680231
## 2002 -0.017857197 0.002332744 0.019128104 0.0058902271 -0.0250849119
## 2007 0.029251169 -0.001689084 -0.014071677 -0.0000235639 0.0120149131
## PC12
## 1952 -0.006153628
## 1957 0.020268310
## 1962 -0.022063082
## 1967 0.011061548
## 1972 -0.004719186
## 1977 0.001789999
## 1982 0.002280300
## 1987 -0.004208394
## 1992 0.004074155
## 1997 -0.011301594
## 2002 0.016766348
## 2007 -0.007747656
print("Life Expectancy rotation:")
## [1] "Life Expectancy rotation:"
lifeExpectancy.pca$rotation[,1:3]
## PC1 PC2 PC3
## 1952 0.2837703 0.34441629 -0.32952972
## 1957 0.2880971 0.31933803 -0.24260282
## 1962 0.2898354 0.29660608 -0.18199203
## 1967 0.2937536 0.24074227 -0.05922854
## 1972 0.2956429 0.17850314 0.10463448
## 1977 0.2949868 0.11324248 0.26337143
## 1982 0.2969332 0.02533604 0.33250651
## 1987 0.2954179 -0.07504642 0.39549540
## 1992 0.2877706 -0.24266370 0.40007959
## 1997 0.2849288 -0.37063340 0.01000402
## 2002 0.2774799 -0.43388258 -0.34334220
## 2007 0.2744526 -0.44497039 -0.41302205
cor(lifeExp,lifeExpectancy.pca$x)
## PC1 PC2 PC3 PC4 PC5 PC6
## 1952 0.9445114 0.28308016 -0.129073360 -0.081516551 -3.091603e-02 0.055308506
## 1957 0.9589126 0.26246802 -0.095024996 -0.038755272 -3.450243e-05 0.002216994
## 1962 0.9646986 0.24378433 -0.071284382 -0.019571044 1.245202e-02 -0.038799445
## 1967 0.9777400 0.19786915 -0.023199203 0.026140632 1.949406e-02 -0.050944635
## 1972 0.9840284 0.14671402 0.040984237 0.076103503 3.661604e-02 -0.017408759
## 1977 0.9818447 0.09307544 0.103159845 0.102333414 3.846571e-02 0.065726354
## 1982 0.9883232 0.02082401 0.130239335 0.030117317 -6.088414e-02 -0.005318526
## 1987 0.9832796 -0.06168162 0.154911429 -0.004043822 -6.581306e-02 -0.011403151
## 1992 0.9578261 -0.19944841 0.156707006 -0.121255827 4.005493e-02 -0.003383970
## 1997 0.9483673 -0.30462833 0.003918472 -0.057458491 3.239171e-02 0.006503466
## 2002 0.9235740 -0.35661364 -0.134483564 0.030642948 2.933252e-03 0.004283890
## 2007 0.9134979 -0.36572687 -0.161776436 0.053264933 -2.468057e-02 -0.005409316
## PC7 PC8 PC9 PC10 PC11
## 1952 -0.0059278336 0.0110563872 0.0066469564 -0.0154678213 0.0007690448
## 1957 0.0018845639 0.0121402617 -0.0098842381 0.0280745636 -0.0008372452
## 1962 0.0064981827 -0.0521319552 0.0014017756 -0.0055185955 0.0011437564
## 1967 0.0008316197 0.0214847575 0.0021788444 -0.0015375341 -0.0030969670
## 1972 -0.0074500924 0.0259279203 0.0068926167 -0.0107843609 0.0029004930
## 1977 -0.0033764876 -0.0182190063 0.0009129623 0.0057714146 -0.0022045050
## 1982 0.0140678450 0.0011489704 -0.0304045534 -0.0074229448 0.0001554075
## 1987 0.0041032175 -0.0025539314 0.0279861569 0.0085457714 0.0012303556
## 1992 -0.0430490938 -0.0006630837 -0.0060830910 -0.0003850919 0.0046768283
## 1997 0.0565006235 0.0045132268 0.0020942458 -0.0027733617 -0.0128406307
## 2002 0.0187732112 0.0011991897 -0.0006707001 0.0017114280 0.0205868393
## 2007 -0.0440835909 -0.0041617339 -0.0010622661 -0.0002048975 -0.0124437763
## PC12
## 1952 -0.0016576278
## 1957 0.0047891612
## 1962 0.0026764063
## 1967 -0.0140152253
## 1972 0.0121587841
## 1977 -0.0046889737
## 1982 0.0005064825
## 1987 0.0004149916
## 1992 -0.0009218386
## 1997 0.0017238358
## 2002 -0.0022175237
## 2007 0.0012275826
gdp_pca = data.frame(PC1=gdp.pca$x[,1], PC2=gdp.pca$x[,2], PC3=gdp.pca$x[,3], continent=gap[,1], country=gap[,2])
lifeExpectancy_pca = data.frame(PC1=lifeExpectancy.pca$x[,1], PC2=lifeExpectancy.pca$x[,2], PC3=lifeExpectancy.pca$x[,3], continent=gap[,1], country=gap[,2])
ggplot(gdp_pca, aes(x=PC1, y=PC2, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("GDP, PC1 and PC2")
ggplot(gdp_pca, aes(x=PC2, y=PC3, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("GDP, PC2 and PC3")
Kuwait and Congo Dem. are particularly interesting as Kuwait has PC3 and PC2 high, but Congo Dem has high PC2 and low PC3. In my previous interpretations of the PCs I said that if PC2 is high also would be GDP between 1952 and 1977, but low between 1982 and 2007, and indeed it is what I see in the next plot for Kuwait, although a high PC3 would mean a high value between 1997 and 2007 this is opposite to PC2, and as PC2 has a stronger correlation it justifies a positive trend between 1997 and 2007. Another important thing is that as PC1 is higher for Congo Rep. Dem. than for Kuwait, the opposite happens for GDP. The total GDP is 131.4858 and 76.51869 for Kuwait and Congo Rep. Dem. which is consistent.
gdp = gap[,3:14] # gdp
lifeExp = gap[,15:26] # life exp
rownames(gdp) = gap[,2]
rownames(lifeExp) = gap[,2]
#par(mfrow=c(4,3))
plot(as.numeric(gdp["Kuwait",]), x=years, ylab="GDP", main=paste("GDP","Kuwait"), type="l")
plot(as.numeric(gdp["Congo Dem. Rep.",]), x=years, ylab="GDP", main=paste("GDP","Congo Dem. Rep."), type="l")
sum(gdp["Kuwait",])
## [1] 131.4858
sum(gdp["Congo Dem. Rep.",])
## [1] 76.51869
ggplot(lifeExpectancy_pca, aes(x=PC1, y=PC2, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("Life expectancy, PC1 and PC2")
ggplot(lifeExpectancy_pca, aes(x=PC2, y=PC3, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("Life expectancy, PC2 and PC3")
Analyzing Life Expectancy for Zimbabwe and Oman, I see that Oman has a slightly higher PC1 and then also Life Expectancy than Zimbabwe. As Zimbabwe has a high PC2 it means a high value in its figure the first half of the graph (1952 to 1982) and a lower value for the second half(1987 to 2007), however, Oman had the opposite value for this figure for both periods. Both countries have a high value for PC3, and this value affects negatively at the beginning and at the end of the series (between 1952 and 1967 and between 2002 to 2007), but positively between 1972 and 1997. This is coincident with the series of the next plots.
gdp = gap[,3:14] # gdp
lifeExp = gap[,15:26] # life exp
rownames(gdp) = gap[,2]
rownames(lifeExp) = gap[,2]
plot(as.numeric(lifeExp["Zimbabwe",]), x=years, ylab="Life exp.", main=paste("Life exp.","Zimbabwe"), type="l")
plot(as.numeric(lifeExp["Oman",]), x=years, ylab="Life exp.", main=paste("Life exp.","Oman"), type="l")
sum(lifeExp["Zimbabwe",])
## [1] 631.958
sum(lifeExp["Oman",])
## [1] 701.312
lifeExpGdp = gap[,3:26]
rownames(lifeExpGdp) <- gap[,2]
lifeExpGdp.dist=dist(lifeExpGdp, upper = TRUE)
lifeExpGdp.cmd=cmdscale(lifeExpGdp.dist, eig = FALSE, k = 2)
lifeExpGdp_cmd=data.frame(cmd1=lifeExpGdp.cmd[,1], cmd2=lifeExpGdp.cmd[,2], continent=gap[,1], country=gap[,2])
ggplot(lifeExpGdp_cmd, aes(x=cmd1, y=cmd2, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("CMD, Life Expectancy and GDP")
Doing a Comparison of the plots obtained with CMD and PCA, I see that CMD has a huge advantage in terms of compactness, as we are reducing all the data in one graph with just two dimensions, and also the level of separation of the clusters is similar. However, I see that PCA is superior is in terms of interpretability and still having compact results.
In Hotelling's two sample T2-test, the null hypothesis (H0) states that there is no difference between the means of two independent samples. The alternative hypothesis (H1) states that there is a significant difference between the means of the two samples.
The p-value in a statistical test represents the probability of observing a test statistic as extreme as the one calculated, given that the null hypothesis is true. A p-value less than the alpha level (typically 0.05) indicates that there is enough evidence to reject the null hypothesis.
Therefore, in Hotelling's two sample T2-test, if the calculated p-value is less than 0.05, we would reject the null hypothesis and conclude that there is a significant difference between the means of the two samples.
X1=gap%>%filter(continent=="Asia")%>%select(gdpPercap_2007,lifeExp_2007)
X2=gap%>%filter(continent=="Europe")%>%select(gdpPercap_2007,lifeExp_2007)
HotellingsT2(X1, X2)
##
## Hotelling's two sample T2-test
##
## data: X1 and X2
## T.2 = 12.681, df1 = 2, df2 = 60, p-value = 2.55e-05
## alternative hypothesis: true location difference is not equal to c(0,0)
print(paste("Critical value for α=0.05 is",qf(0.95,2,60)))
## [1] "Critical value for α=0.05 is 3.15041131058273"
\(H_0: μ_1=μ_2\) vs \(H_1: μ_2≠μ_2\) The hypothesis \(H_0\) is that GDP and life expectancy in 1952 were equal for Asia and Europe.In this case, the p-value is very low then there is huge evidence of a difference between Asia and Europe in terms of GDP and life expectancy in the year 1952. The critical value lower than the T.2=39 supports this hypothesis as well.
X1=gap%>%filter(continent=="Asia")%>%select(gdpPercap_1952,lifeExp_1952)
X2=gap%>%filter(continent=="Europe")%>%select(gdpPercap_1952,lifeExp_1952)
HotellingsT2(X1, X2)
##
## Hotelling's two sample T2-test
##
## data: X1 and X2
## T.2 = 39.347, df1 = 2, df2 = 60, p-value = 1.21e-11
## alternative hypothesis: true location difference is not equal to c(0,0)
print(paste("Critical value for α=0.05 is",qf(0.95,2,60)))
## [1] "Critical value for α=0.05 is 3.15041131058273"
For the year 1952, the p-value is even lower, then I could say that they were more different then.
For the linear discriminant analysis First I use 30% of the data for the test set (42 examples approximately) and the rest for the training set.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
set.seed(123) # so that I get the same results each time.
test.index <- sample(1:142, size=42)
gap.test <- dplyr::select(gap, -country)[test.index,]
gap.train <- dplyr::select(gap, -country)[-test.index,]
gap.lda<-lda(continent ~ ., gap.train)
gap.pred <- predict(gap.lda, gap.test)
print(paste("The predictive accuracy is ",
sum(gap.pred$class == gap.test$continent)/dim(gap.test)[1]*100, "%"))
## [1] "The predictive accuracy is 57.1428571428571 %"
The quite low predictive accuracy obtained in this case may be explained by the highly unbalanced data, where, as we see in the next R output, Oceania represents just 0.9% and Africa 36.8% of the data.
gap.lda
## Call:
## lda(continent ~ ., data = gap.train)
##
## Prior probabilities of groups:
## Africa Americas Asia Europe Oceania
## 0.38 0.20 0.21 0.20 0.01
##
## Group means:
## gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967
## Africa 6.811431 6.896881 6.996266 7.134151
## Americas 8.061785 8.170572 8.255846 8.385747
## Asia 7.486033 7.629678 7.732015 7.833065
## Europe 8.440622 8.669796 8.862741 9.078023
## Oceania 9.214292 9.301063 9.410602 9.583704
## gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987
## Africa 7.255926 7.297963 7.306340 7.251569
## Americas 8.540909 8.675896 8.683815 8.683559
## Asia 8.035114 8.167184 8.207409 8.201288
## Europe 9.305521 9.444677 9.527915 9.606729
## Oceania 9.728457 9.816523 9.876990 9.993734
## gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
## Africa 7.237518 7.278828 7.339940 7.460726
## Americas 8.721062 8.796606 8.818189 8.950151
## Asia 8.285280 8.423720 8.507930 8.704134
## Europe 9.515321 9.654124 9.804083 9.978058
## Oceania 10.061549 10.203516 10.331619 10.446839
## lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 lifeExp_1972
## Africa 39.11326 41.19161 43.28484 45.35403 47.53913
## Americas 52.05085 54.85880 57.36660 59.55180 61.60320
## Asia 46.48762 49.48419 51.54983 54.85820 57.54352
## Europe 64.54480 66.76775 68.78605 70.01170 71.05750
## Oceania 69.12000 70.33000 70.93000 71.10000 71.93000
## lifeExp_1977 lifeExp_1982 lifeExp_1987 lifeExp_1992 lifeExp_1997
## Africa 49.78050 51.74676 53.35676 53.65358 53.13255
## Americas 63.62155 65.55095 67.58430 69.19275 70.86310
## Asia 60.00283 62.51476 64.66881 66.21510 67.80638
## Europe 72.19850 73.12010 73.93275 74.57110 75.67085
## Oceania 73.49000 74.74000 76.32000 77.56000 78.83000
## lifeExp_2002 lifeExp_2007
## Africa 52.33692 53.82224
## Americas 72.24320 73.44930
## Asia 68.99995 70.54000
## Europe 76.93105 77.88905
## Oceania 80.37000 81.23500
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4
## gdpPercap_1952 -0.36150213 -1.44996671 6.01673022 1.93345865
## gdpPercap_1957 -0.30188638 2.83501245 -2.00963775 -4.18458786
## gdpPercap_1962 2.91908061 -5.12402027 -5.65144860 4.34535216
## gdpPercap_1967 -1.85018480 1.46782661 5.56565210 -1.54204562
## gdpPercap_1972 -1.94550293 2.72667850 -4.82904106 1.74041776
## gdpPercap_1977 2.20684661 -1.32238060 2.13900549 -3.72196862
## gdpPercap_1982 -2.10106879 1.07987750 -3.19438747 3.21563908
## gdpPercap_1987 2.86928446 2.32255546 1.41979461 -2.12055524
## gdpPercap_1992 -1.68799301 -2.76358361 2.77325600 -0.84987614
## gdpPercap_1997 -0.04424094 -1.68224695 -2.04000218 0.66252402
## gdpPercap_2002 0.21653260 1.89425339 4.90532571 -2.61462903
## gdpPercap_2007 0.28121048 -0.11918190 -4.72039347 2.95403657
## lifeExp_1952 -0.08756566 0.76457722 -0.64102214 -0.13227001
## lifeExp_1957 0.07809968 -0.88840067 0.82378694 1.22463806
## lifeExp_1962 0.15781643 0.04430716 0.49443440 -0.16046328
## lifeExp_1967 -0.14964503 0.39955938 -1.14871006 -1.93747242
## lifeExp_1972 0.10912394 0.01809590 0.08758833 0.91871801
## lifeExp_1977 0.15861272 -0.36937492 1.08525003 0.32980109
## lifeExp_1982 -0.16325452 0.40201905 -0.92099797 -0.89695407
## lifeExp_1987 -0.09602775 -0.47149532 0.27608810 0.91267463
## lifeExp_1992 0.04242214 0.09950723 -0.03822550 -0.13169316
## lifeExp_1997 -0.03104650 0.08066902 -0.09532618 -0.09088631
## lifeExp_2002 0.10060555 0.03714213 0.40587834 -0.04735895
## lifeExp_2007 0.03443748 -0.14497433 -0.33056848 0.08294817
##
## Proportion of trace:
## LD1 LD2 LD3 LD4
## 0.7426 0.1739 0.0677 0.0157
The proportion of the trace shows that LD1 and LD2 are doing most of the separation and in the next table we see that we misclassify Africa as Asia two times, Americas as Africa 1, Americas as Asia 3 and America as Europe 1, and so on.
table(gap.pred$class, gap.test$continent)
##
## Africa Americas Asia Europe Oceania
## Africa 11 0 2 0 0
## Americas 2 2 6 2 0
## Asia 1 1 3 0 0
## Europe 0 2 1 8 1
## Oceania 0 0 0 0 0
Discussing the difference between this plot and the plot I found using PCA. For this graph, in order to do a better visual comparison with the previous graph, I do LDA with all the data instead of splitting it into training and test set. As the objective of PCA is to maximize the variance of the projected data on the PCA graph data looks more disperse, by contrast, as LDA maximizes the variance between groups minimizing variance within, the groups look more compact.
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
gap.lda.full = lda(continent ~ ., gap[,-c(2)])
gap.pred.full <- predict(gap.lda.full, gap[,-c(2)])
lda_full=data.frame(LD1=gap.pred.full$x[,1], LD2=gap.pred.full$x[,2], continent=gap[,1], country=gap[,2])
lda=ggplot(lda_full, aes(x=LD1, y=LD2, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("LDA, Life Expectancy and GDP")
pca=ggplot(gdp_pca, aes(x=PC1, y=PC2, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("GDP, PC1 and PC2")
grid.arrange(lda,pca)
For this problem, as I know that there are 5 continents, 5 clusters are what I expect. Also looking at the plot made with the function fviz_nbclust of R with the method wss(total within sum of square) which is a heuristic method, the use of 5 clusters makes sense.
set.seed(123)
gap2 <- gap[,3:26]
gap.k <- kmeans(gap2, centers = 5, nstart=25)
table(gap.k$cluster, gap$continent)
##
## Africa Americas Asia Europe Oceania
## 1 2 11 8 10 0
## 2 6 8 14 1 0
## 3 23 2 6 0 0
## 4 0 4 4 19 2
## 5 21 0 1 0 0
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(gap2, kmeans, method = "wss")
fviz_cluster(gap.k, gap2, ellipse.type = "norm")
gap.scaled <- gap
gap.scaled[,3:26] <- scale(gap[,3:26])
In this section, hierarchical clustering single, complete, ward, and average are applied to the data. On the next R outputs, we see that single clustering performs very bad as it is classifying most of the data in just one cluster. Complete performs slightly better, but still from the first to the fourth clusters contain elements from Africa, America, and Asia. Ward and average also aren’t performing better than k-means either, concluding that countries do not naturally cluster by continent.
gap.single <- hclust(dist(gap.scaled[,3:26],method="euclidean"),method="single")
plot(gap.single, labels=gap$continent,cex=0.7)
table(cutree(gap.single, k=5), gap$continent)
##
## Africa Americas Asia Europe Oceania
## 1 49 25 32 30 2
## 2 1 0 0 0 0
## 3 1 0 0 0 0
## 4 1 0 0 0 0
## 5 0 0 1 0 0
gap.complete <- hclust(dist(gap.scaled[,3:26],method="euclidean"),method="complete")
plot(gap.complete, labels=gap$continent,cex=0.7)
table(cutree(gap.complete, k=5), gap$continent)
##
## Africa Americas Asia Europe Oceania
## 1 9 11 9 3 0
## 2 31 1 5 0 0
## 3 8 0 7 0 0
## 4 4 8 6 4 0
## 5 0 5 6 23 2
gap.ward <- hclust(dist(gap.scaled[,3:26],method="euclidean"),method="ward.D2")
plot(gap.ward, labels=gap$continent,cex=0.7)
table(cutree(gap.ward, k=5), gap$continent)
##
## Africa Americas Asia Europe Oceania
## 1 11 9 10 1 0
## 2 23 1 7 0 0
## 3 16 0 5 0 0
## 4 2 6 4 2 0
## 5 0 9 7 27 2
gap.average <- hclust(dist(gap.scaled[,3:26],method="euclidean"),method="average")
plot(gap.average, labels=gap$continent,cex=0.7)
table(cutree(gap.average, k=5), gap$continent)
##
## Africa Americas Asia Europe Oceania
## 1 13 7 16 1 0
## 2 1 0 0 0 0
## 3 36 1 7 0 0
## 4 2 17 9 29 2
## 5 0 0 1 0 0
I compared Ordinary Least Squares, Principal Components and Ridge Regression. I trained these three models with the training set and then tested them with the test set obtaining the RMSEP, the model with the lowest RMSEP is then trained with the full dataset to predict the 2007 life expectancy from the GDP values.
gap4Regression.training=gap[-test.index,c(26,3:14)]
gap4Regression.test=gap[test.index,c(26,3:14)]
gap4Regression.full=gap[,c(26,3:14)]
ols=lm(lifeExp_2007~.,data=gap4Regression.training)
ols_predict = predict(ols, gap4Regression.test)
mean((ols_predict-gap4Regression.test$lifeExp_2007)^2)
## [1] 36.09367
Principal Components Regression Using the PCR function from the pls package R calculates this form me.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr_fit <- pcr(lifeExp_2007~., data=gap4Regression.training, validation='CV')
summary(pcr_fit)
## Data: X dimension: 100 12
## Y dimension: 100 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 12.11 8.292 7.882 8.018 8.352 8.340 8.396
## adjCV 12.11 8.280 7.867 7.993 8.302 8.286 8.337
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 8.523 8.510 8.727 8.706 8.607 8.826
## adjCV 8.456 8.429 8.648 8.630 8.519 8.724
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 94.76 98.70 99.4 99.65 99.78 99.87 99.90
## lifeExp_2007 54.86 59.71 60.4 61.54 62.18 62.40 62.86
## 8 comps 9 comps 10 comps 11 comps 12 comps
## X 99.93 99.96 99.97 99.99 100.00
## lifeExp_2007 63.26 63.29 64.08 65.86 65.86
plot(RMSEP(pcr_fit), legendpos = "topright")
As the lowest cross-validation error is with two components are minimizing the RMSEP, I am using 2 of them.
pcr_predict = predict(pcr_fit, gap4Regression.test[2:13], ncomp=2)
mean((pcr_predict-gap4Regression.test$lifeExp_2007)^2)
## [1] 29.54078
##Ridge Regression.
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-1
lambdas <- 10^seq(3,-2,by=-0.1)
cv_fit <- cv.glmnet(as.matrix(gap4Regression.training[,2:13]), gap4Regression.training$lifeExp_2007, alpha = 0, lambda = lambdas)
plot(cv_fit)
cv_fit$lambda.min
## [1] 1.995262
cv_fit$lambda.1se
## [1] 50.11872
ridge_mod = glmnet(as.matrix(gap4Regression.training[,2:13]), gap4Regression.training$lifeExp_2007, alpha=0, lambda=cv_fit$lambda.1se)
ridge_pred = predict(ridge_mod, newx = as.matrix(gap4Regression.test[,2:13]))
mean((ridge_pred - gap4Regression.test$lifeExp_2007)^2)
## [1] 56.06069
Finally, an RMSEP of 29.54078 for PCA regression is the lowest value obtained.
| Model | RMSEP |
|---|---|
| OLS | 36.09367 |
| PCR | 29.54078 |
| Ridge | 52.55654 |
Then I proceed to train the PCA regression model with the full data-set and obtain the predictions:
gap4Regression=gap[,c(26,3:14)]
pcr_fit <- pcr(lifeExp_2007~., data=gap4Regression.training, validation='CV')
pcr_predict = predict(pcr_fit, gap4Regression[2:13], ncomp=2)
rownames(pcr_predict)=gap[,2]
pcr_predict
## , , 2 comps
##
## lifeExp_2007
## Algeria 67.96314
## Angola 62.00690
## Benin 57.44125
## Botswana 74.23988
## Burkina Faso 56.01175
## Burundi 50.96851
## Cameroon 60.56682
## Central African Republic 52.57879
## Chad 56.02340
## Comoros 55.75138
## Congo Dem. Rep. 46.45654
## Congo Rep. 65.06096
## Cote d'Ivoire 59.26768
## Djibouti 59.98675
## Egypt 67.25404
## Equatorial Guinea 67.39622
## Eritrea 53.64180
## Ethiopia 51.69248
## Gabon 74.64548
## Gambia 53.06808
## Ghana 55.53803
## Guinea 55.06320
## Guinea-Bissau 53.33731
## Kenya 58.00297
## Lesotho 58.31557
## Liberia 50.53649
## Libya 73.14370
## Madagascar 54.61341
## Malawi 53.32421
## Mali 55.00017
## Mauritania 59.11304
## Mauritius 71.09131
## Morocco 64.21723
## Mozambique 51.22659
## Namibia 65.53491
## Niger 51.35890
## Nigeria 59.51248
## Reunion 69.05445
## Rwanda 53.98705
## Sao Tome and Principe 58.59562
## Senegal 57.89567
## Sierra Leone 53.83512
## Somalia 54.40399
## South Africa 70.06419
## Sudan 59.70138
## Swaziland 66.55392
## Tanzania 54.85076
## Togo 55.12275
## Tunisia 68.52896
## Uganda 53.97149
## Zambia 56.07777
## Zimbabwe 52.76645
## Argentina 71.77855
## Bolivia 63.99159
## Brazil 71.38798
## Canada 80.00019
## Chile 71.69024
## Colombia 68.82018
## Costa Rica 70.15309
## Cuba 68.52489
## Dominican Republic 66.40230
## Ecuador 69.20480
## El Salvador 66.84224
## Guatemala 66.90404
## Haiti 57.06665
## Honduras 63.76635
## Jamaica 69.34965
## Mexico 72.69034
## Nicaragua 61.11579
## Panama 70.65893
## Paraguay 65.99960
## Peru 68.03786
## Puerto Rico 76.66813
## Trinidad and Tobago 73.03540
## United States 81.08037
## Uruguay 70.54221
## Venezuela 71.35231
## Afghanistan 53.34948
## Bahrain 77.58911
## Bangladesh 55.94368
## Cambodia 55.83928
## China 64.20445
## Hong Kong China 81.40710
## India 59.69407
## Indonesia 64.06723
## Iran 71.64267
## Iraq 65.77304
## Israel 78.20614
## Japan 80.91343
## Jordan 65.85100
## Korea Dem. Rep. 61.39613
## Korea Rep. 77.95538
## Kuwait 78.07161
## Lebanon 70.45787
## Malaysia 72.83259
## Mongolia 62.21165
## Myanmar 51.59343
## Nepal 55.76550
## Oman 79.10824
## Pakistan 61.89211
## Philippines 62.77680
## Saudi Arabia 77.95701
## Singapore 82.80819
## Sri Lanka 63.55577
## Syria 65.58129
## Taiwan 79.62423
## Thailand 69.57438
## Vietnam 58.91653
## West Bank and Gaza 67.44063
## Yemen Rep. 62.05297
## Albania 65.86756
## Austria 80.53609
## Belgium 79.78290
## Bosnia and Herzegovina 68.46156
## Bulgaria 70.74783
## Croatia 73.73951
## Czech Republic 76.00763
## Denmark 79.97832
## Finland 79.31436
## France 79.46689
## Germany 79.81658
## Greece 78.36104
## Hungary 74.40764
## Iceland 80.35377
## Ireland 79.79269
## Italy 79.31342
## Montenegro 70.76260
## Netherlands 80.22076
## Norway 82.48104
## Poland 72.96436
## Portugal 77.31539
## Romania 71.12079
## Serbia 71.73820
## Slovak Republic 74.14363
## Slovenia 77.57260
## Spain 78.81542
## Sweden 79.39946
## Switzerland 80.20050
## Turkey 69.75724
## United Kingdom 79.03436
## Australia 79.35942
## New Zealand 77.10131
Accuracy of the model
mean((gap4Regression.full$lifeExp_2007-pcr_predict)^2)
## [1] 49.53311
Our analysis reveals a strong correlation between GDP per capita and life expectancy, with higher GDP generally associated with longer life expectancy. Principal component analysis helps us understand the variability in the data and identify key components that explain the variation in GDP and life expectancy. The interpretation of the leading principal components provides valuable insights into the relationship between the variables for different countries.
Hypothesis testing is conducted to compare GDP and life expectancy between Asia and Europe, providing statistical evidence to reject the null hypothesis that both continents have equal GDP and life expentancy in 2007. Linear discriminant analysis and clustering are employed to analyze the data from a classification perspective, revealing the clustering patterns within the continents.
Finally, linear regression models are trained to predict life expectancy in 2007 based on GDP values, with principal components regression yielding the most accurate results. Overall, this report highlights the importance of economic development in improving life expectancy and provides valuable insights into the relationship between GDP per capita and life expectancy.